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In this paper we adapt online estimation strategies to perform 
model-based clustering on large networks. Our work focuses on two 
algorithms, the first based on the SAEM algorithm, and the second 
on variational methods. These two strategies are compared with ex- 
isting approaches on simulated and real data. We use the method to 
decipher the connexion structure of the political websphere during 
the US political campaign in 2008. We show that our online EM- 
based algorithms offer a good trade-off between precision and speed, 
when estimating parameters for mixture distributions in the context 
of random graphs. 

1. Introduction. Analyzing networks has become an essential part of a 
number of scientific fields. Examples include such widely differing phenom- 
ena as power grids, protein-protein interaction networks and friendship. In 
this work we focus on particular networks which are made of political We- 
blogs. With the impact of new social network websites like Myspace and 
Facebook, the web has an increasing influence on the political debate. As 
an example, Adamic and Glance (2005) showed that blogging played an im- 
portant role in the political debate of the 2004 US Presidential Election. 
Although only a small minority of Americans actually used these Weblogs, 
their influence extended far beyond their readership, as a result of their in- 
teractions with national mainstream media. In this article we propose to 
uncover the connexion structure of the political websphere during the US 
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political campaign in 2008. This data set consists of a one-day snapshot of 
over 130,520 links and 1870 manually classified websites (676 liberal, 1026 
conservative and 168 independent) where nodes are connected if there exists 
a citation from one to another. 

Many strategies have been developed to study networks structure and 
topology. A distinction can be made between model-free [Newman (2006); 
Ng, Jordan and Weiss (2002)] and model-based methods, with connexions 
between parametric and nonparametric models [Bickel and Chen (2009)]. 
Among model-based methods, model-based clustering has provided an ef- 
ficient way to summarize complex networks structures. The basic idea of 
these strategies is to model the distribution of connections in the network, 
considering that nodes are spread among an unknown number of connec- 
tivity classes which are themselves unknown. This generalizes model-based 
clustering to network data, and various modeling strategies have been con- 
sidered. Nowicki and Snijders (2001) propose a mixture model on dyads that 
belong to some relational alphabet, Daudin, Picard and Robin (2008) pro- 
pose a mixture on edges, Handcock, Raftery and Tantrum (2007) consider 
continuous hidden variables and Airoldi et al. (2005, 2007, 2008) consider 
both mixed membership and stochastic block structure. 

In this article our concern is not to assess nor to compare the appropriate- 
ness of these different models, but we focus on a computational issue that is 
shared by most of them. Indeed, even if the modeling strategies are diverse, 
EM like algorithms constitute a common core of the estimation strategy 
[Dempster, Laird and Rubin (1977); Snijders and Nowicki (1997)], and this 
algorithm is known to be slow to convergence and to be very sensitive to 
the size of the data set. This issue should be put into perspective with a 
new challenge that is inherent to the analysis of network data sets which is 
the development of optimization strategies with a reasonable speed of ex- 
ecution, and which can deal with networks composed of tens of thousands 
of nodes, if not more. To this extent, Bayesian strategies are limited, as 
they may not handle networks with more than a few hundred [Snijders and 
Nowicki (1997); Nowicki and Snijders (2001)] or a few thousand [Airoldi et 
al. (2008)], and heuristic-based algorithms may not be satisfactory from the 
statistical point of view [Newman and Leicht (2007)]. Variational strategies 
have been proposed as well [Airoldi et al. (2005); Daudin, Picard and Robin 
(2008)], but they are concerned by the same limitations as EM. Thus, the 
new question we assess in this work is "how to perform efficient model-based 
clustering from a computational point of view on very large networks or on 
networks that grow over time?" 

Online algorithms constitute an efficient alternative to classical batch 
algorithms when the data set grows over time. The application of such 
strategies to mixture models has been studied by many authors [Tittering- 
ton (1984); Wang and Zhao (2006)]. Typical clustering algorithms include 
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the online /c-means algorithm [MacQueen (1967)]. More recently, Liu et al. 
(2006) modeled Internet traffic using a recursive EM algorithm for the es- 
timation of Poisson mixture models. However, an additional difficulty of 
mixture models for random graphs is that the computation of Pr{Z|X}, 
the distribution of the hidden label variables Z conditionally on the ob- 
servation X, cannot be factorized due to conditional dependency [Daudin, 
Picard and Robin (2008)]. In this work we consider two alternative strategies 
to deal with this issue. The first one is based on the Monte Carlo simulation 
of Pr{Z|X}, leading to a Stochastic version of the EM algorithm (Stochas- 
tic Approximation EM, SAEM) [Delyon, Lavielle and Moulines (1999)]. The 
second one is the variational method proposed by Daudin, Picard and Robin 
(2008) which consists in a mean-field approximation of Pr{Z|X}. This strat- 
egy has also been proposed by Latouche, Birmele and Ambroise (2008) and 
by Airoldi et al. (2008) in the Bayesian framework. 

In this article we begin by describing the blog database from the 2008 
US presidential campaign. Then we present the MixNet model proposed 
by Daudin, Picard and Robin (2008), and we compare the model with its 
principal competitors in terms of modeling strategies. We use the Sampson 
(1968) data set for illustration. We derive the online framework to estimate 
the parameters of this mixture using SAEM or variational methods. Sim- 
ulations are used to show that online methods are very effective in terms 
of computation time, parameter estimation and clustering efficiency. These 
simulations integrate both fixed-size and increasing size networks for which 
online methods have been designed. Finally, we uncover the connectivity 
structure of the 2008 US Presidential websphere using the proposed varia- 
tional online algorithm of the MixNet model. 

2. Data presentation. In this community extraction experiment, we used 
a data set obtained on November 7, 2007 by the French company RTGI (In- 
formation Networks, Territories and Geography) using a specific methodol- 
ogy similar to Fouetillou (2007). This data set consists of a one-day snapshot 
of over two thousand websites, one thousand of which featured in two online 
directories: http://wonkosphere.com and http://www.politicaltrends.info. 
The first site provides a manual classification, and the second an automatic 
classification based on text analysis. From this seed of a thousand sites, a 
web crawler [Drugeon (2005)] collected a maximum of 100 pages per host- 
name which is in general the sitename. External links were examined to 
check the connectivity with visited and unvisited websites. If websites were 
still unvisited, and if there existed a minimal path of distance less than two 
between a hostname which belongs to the seed and these websites, then the 
web crawler collected them. 
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Using this seed-extension method, 200,000 websites were collected, and a 
network of websites was created where nodes represent hostnames (a host- 
name contains a set of pages) and edges represent hyperlinks between dif- 
ferent hostnames. Multiple links between two different hostnames were col- 
lapsed into a single link. Intra-domain links were taken into account if host- 
names were not similar. For this web network, we computed an authority 
score [Kleinberg (1999)] and a keyword score TF/IDF [Salton, Wong and 
Yang (1975)] on focused words (political entities) in order to identify respec- 
tively nodes with high-quality websites (high authority scores) and centered 
on those topics (on a political corpus). 870 new websites emerged out of 
these two criteria. They were checked by experts and the validity of the 
seed confirmed. The final tally was 130,520 links and 1870 sites: 676 liberal, 
1026 conservative and 168 independent. The data can be downloaded at 
http: / /stat .genopole.cnrs.fr / sg/Members /hzanghi . 

3. A mixture model for networks. 

3.1. Model and notation. We model the observed network of websites 
by a random graph G, where V denotes the set of n fixed vertices which 
represent hyperlinks between blogs. These random edges are modeled by 
X = {Xij,(i,j) € V 2 }, a set of random variables coding for the nature of 
connection between blogs i and j. The nature of the links can be discrete 
or continuous, and we consider a model with distributions belonging to the 
exponential family. In the MixNet model we suppose that nodes are spread 
among Q hidden classes and we denote by Z{ q the indicator variable such 
that {Zi q = 1} if blog i belongs to class q. We denote by Z = (Zi, . . . , Z n ) 
the vector of random independent label variables such that 

Zi~M(l,a = {ai,...,aQ}), 

with a. the vector of proportions for classes. In the following, formulas are 
valid for the case of directed and undirected networks. Self- loops have not 
been introduced for simplicity of notation, and have been implemented in 
the MixNet software. 

Conditional distribution. MixNet is defined using the conditional dis- 
tribution of edges given the label of the nodes. Xy's are supposed to be 
conditionally independent: 

Pr{X|Z;r7} = Y[Y[Px{X ij \Z iq Z jl = l;r, ql } z ^< , 
v g,i 

and Pr{Xij\Zi q Zji = l;r) q i} is supposed to belong to the regular exponential 
family, with natural parameter r] q i: 

\ogVr {Xij\Zi q Zji = l;r) q i} = rf ql h(Xij) - a(n ql ) + b(Xij), 
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where h(Xij) is the vector of sufficient statistics, a a normalizing constant 
and b a given function. Consequently, the conditional distribution of the 
graph is also from the exponential family: 

logPr{X|Z; V}=Y1 Z n Z m\lK X ii) ~ £ Z ig Z jia ( Vql ) + b(X l3 ). 

ij,ql ij,ql ij 

Examples of such distributions are provided in the Appendix. 

Models comparison. Many strategies have been considered to construct 
models for clustering in networks. Variations mainly concern the nature 
of the link between nodes and the definition of nodes' memberships. For 
instance, the stochastic blockstructure model [Snijders and Nowicki (1997); 
Nowicki and Snijders (2001)] considers links that are dyads (Xij,Xji), whereas 
MixNet considers a model on edges only. Consequently, MixNet implicitly as- 
sumes the independence of Xij and Xji conditionally on the latent structure. 
As for the definition of the label variables, the Mixed Membership Stochastic 
Blockmodel (MMSB) has been proposed to describe the interactions between 
objects playing multiple roles [Airoldi et al. (2008)]. Consequently, the hid- 
den variables of their model can stand for more than one group for one node, 
whereas MixNet only considers one label per node. Airoldi et al. (2008) also 
model the sparsity of the network. This could be done in the context of 
MixNet by introducing a Dirac mass on zero for the conditional distribution 
of edges. Differences among approaches also concern the statistical frame- 
work that defines subsequent optimization strategies. The Bayesian setting 
has been a framework chosen by many authors, as it allows the integration 
of prior information and hierarchical structures [Airoldi et al. (2008)]. On 
the contrary, our approach does not necessarily rely on stochastic strategies, 
meaning that each run provides the same set of parameters. However, the 
likelihood of mixture models in general is multimodal, which is a problem for 
both approaches. In MCMC procedures it leads to potential label switching 
issues, and the variational EM may converge to local maxima. 

As the model and the statistical frameworks are different, clustering re- 
sults are likely to be very different as well. In order to illustrate our point, 
we deviate from the political blog data and we use the small data set of 
Sampson (1968) which is used in Airoldi et al. (2008). This data set de- 
scribes relational data between monks in a monastery (whom do you like 
data). Figure 1 shows 3 possible partitionings of this graph, the first one 
corresponds to Sampson's observations, the second one is the result of the 
MMSB model as presented in Airoldi et al. (2008), and the third one is 
provided by MixNet. Individual labels are provided in Table 1. As already 
noted by the authors, the MMSB classes overlap with the relational cate- 
gories provided by Sampson. This is not the case for MixNet, which uncovers 
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Table 1 
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classes of connectivity that show strong inter-connections but very few intra- 
connections (n). Since one link exists when a monk likes another, MixNet 
clusters are made of monks that like the same sets of other monks. For in- 
stance, the blue cluster is made of two monks that like each other and that 
like all monks assigned to the green cluster. The monks in the green cluster 
do not seem to like each other, but prefer the monks assigned to the red and 
purple clusters. As a consequence, both approaches provide different infor- 
mation and are very complementary with more modeling possibilities in the 
MMSB framework, due to the mixed membership and the prior information 
integration possibilities. The relevance of MixNet results has been published 
elsewhere [Picard et al. (2009)], and our aim in this article is not to com- 
pete the models. Our point is rather computational: we aim at providing an 
efficient method to perform model-based clustering on large networks. We 
use the MixNet model as a basis for development, but the online framework 
we develop could be applied to the MMSB model as well. 

Joint distribution. Since MixNet is defined by its conditional distribu- 
tion, we first check that the joint distribution also belongs to the exponential 



Sampson Labels Airoldi Labels MixNet Labels x 




Fig. 1. Monk data set with different labels: Original categories obtained by Sampson (1968), Labels obtained by Airoldi et al. (2008), q 
MixNet labels. Estimated block model (B) for MMSB, and estimated connectivity matrix (it) for MixNet. 2 
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family. Using notation 



'N q {Z) = Y,Zi, 



i 



H q i(X., Z) = Z iq Zjih(Xij), 



< 



G gl (Z) = Y,Zi g Zjl = iV,(Z)7V z (Z), 



a q = exp(wg)/ ^2 ex P( w 



and 



' T(X, Z) = ({N q (Z)}, {H ql (X, Z)}, {Ggi(Z)}) 

P=(Wq},{ r lql},{- a ( r lql)}), 

A(/3) = nlog^expw;, 



B(X) = ^6(JX ij ), 



we have the factorization logPr{X, Z; j3} = /3*T(X, Z) - A((3) + B(X), which 
proves the claim. The sufficient statistics T(X, Z) of the complete-data 
model are the number of nodes in the classes N q (Z), the characteristics 
of the between-group links {H q i through function h that can stand for the 
number of between group links or for the intensity of the connections in the 
case of edges with Poisson or Gaussian distributions), and the product of 
frequencies between classes G q [. In the following we aim at estimating (3. 

3.2. Sufficient statistics and online recursion. Online algorithms are in- 
cremental algorithms which recursively update parameters, using current 
parameters and new observations. We introduce the following notation. Let 
us denote by X' n ' = {X,j}" J=1 the adjacency matrix of the data, when n 

nodes are present, and by Z^l the associated labels. A convenient notation 
in this context is X^, = {Xij, j € V}, which denotes all the edges related to 
node i. Note that the addition of one node leads to the addition of n + 1 
potential connections. 

The use of online methods is based on the additivity of the sufficient 
statistics regarding the addition of a new node. We can show that 



' N q (Z^) = N q (zW) + Z n+1 , q , 
< ff ?i (X[»+ 1 ],Z["+ 1 ]) = i? ((i (XN ) 5 
G ql {Z^) = G ql {ZW) + (§ +l \ 
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with 

n n 
= %n+l,q ^2 Zjlh(X n +ij) + Z n+ ij ^2 Ziqh(Xi :n+ i), 
j=l i=l 

C^ +1] = Z n+hq Nl n] + Z n+1>l N^ + Z n+l>q I{q = I}. 
Then if we define T(X n+1 ,„ Z^+H) = (Z n+1>q , i^}' {C^' 1 }), we get 
(3.1) r(x[ n+1 ],z[ n+1 l) =T(XM,ZM) +T(X n+ i i .,z[ n+1 l). 

Those equations will be used for parameter updates in the online algo- 
rithms. 

3.3. Likelihoods and online inference. Existing estimation strategies are 
based on maximum likelihood, and algorithms related to EM are used for 
optimization purposes. The aim is to maximize the conditional expectation 
of the complete-data log-likelihood 

QW) = £ Pr{Z|X; (3'} log Pr{X, Z; (3}, 
z 

and the main difficulty is that Pr{Z|X; f3'} cannot be factorized and needs to 
be approximated [Daudin, Picard and Robin (2008)]. A first strategy to sim- 
plify the problem is to consider a classification EM-based strategy [Celeux 
and Govaert (1992)]. In this setting label variables are considered as non- 
random and are replaced by their prediction (0/1). This is a generalization 
of the /c-means algorithm for which the problem of computing Pr{Z|X} is 
left apart. This strategy has been the subject of a previous work [Zanghi, 
Ambroise and Miele (2008)]. It is known to give biased estimates, but is very 
efficient from a computational time point of view. 

To this strategy, we propose two different alternatives based on the Stochas- 
tic Approximation EM approach [Delyon, Lavielle and Moulines (1999)] 
which approximates Pr{Z|X} using Monte Carlo simulations, and on the 
so-called variational approach, which consists of approximating Pr{Z|X} 
by a more tractable distribution on the hidden variables. In their online 
versions, these algorithms optimize Q(/3|/3') sequentially, while nodes are 
added. To this extent, we introduce notation 

Q n+ i(/3|/3W) = ^ Pr{Z[ n+1 ]|X[ n+1 ];/3[ n ]}i gP r {x["' +1 ],z[ n+1 ];/3}, 

with [n + 1] being either the number of nodes or the increment of the algo- 
rithm, which are identical in the online context. 
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4. Stochastic approximation EM for network mixture. 

4.1. A short presentation of SAEM. An original way of estimating the 
parameters of the MixNet model is to approximate the expectation of the 
complete data log-likelihood using Monte Carlo simulations corresponding to 
the Stochastic Approximation EM algorithm [Delyon, Lavielle and Moulines 
(1999)]. In situations where maximizing Q((3\(3') is not in a simple closed 
form, the SAEM algorithm maximizes an approximation Q(j3\j3') computed 
using standard stochastic approximation theory such that 

(4.1) Q(/3|/3') W = QWf~ 1] + Pk(QW) ~ Q((3\(3') [k ~ 1] ), 

where k is an iteration index, {pk}k>i a sequence of positive step size and 
where Q((3\(3') is obtained by Monte Carlo integration. This is a simula- 
tion of the expectation of the complete log-likelihood using the posterior 
Pr{Z[X}. Each iteration k of the algorithm is broken down into three steps: 

Simulation of the missing data. This can be achieved using Gibbs Sampling 
of the posterior Pr{ZjX}. The result at iteration number k is m(k) real- 
izations of the latent class data Z: (Z(l), . . . , Z(m(fc))). 

Stochastic approximation of Q((3\(3') using equation (4.1), with 

~ m{k) 

(4.2) Q((3\(3') = — -V logPr(X, Z(s);/3). 

m{k) 

Maximization of Q(/3\(3')^ according to (3. 

As regards the online version of the algorithm, the number of iterations k 
usually coincides with n + 1 , the number of nodes of the network. Although 
it is possible to go further in the iterative process to improve the estimates, it 
is rarely necessary since the results obtained with n + 1 iterations are usually 
reliable. This can be explained by the fact that the MixNet model is robust 
to sampling. The information in the network is indeed highly redundant 
and a reliable estimation of the network parameters can be obtained with a 
small sample (a few dozen) of the nodes using a classical batch algorithm. 
When n is large, using an online algorithm with all the nodes is similar to 
performing many iterations of a batch algorithm on a small sample. 

4.2. Simulation o/Pr{Z|X} in the online context. We use Gibbs sam- 
pling which is applicable when the joint distribution is not known explicitly, 
but the conditional distribution of each variable is known. Here we generate 
a sequence of Z approaching Pr{Z|X} using Pr{Zi q = l|X,Z\j}, where Z\j 
stands for the class of all nodes except node i. The sequence of samples is 
a Markov chain, and the stationary distribution of this Markov chain cor- 
responds precisely to the joint distribution we wish to obtain. In the online 
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context, we consider only one simulation to simulate the class of the last 
incoming node using 

Pr{Z n+1 , g = l|X[" +1 l,Z W } 

Pr{Z w+ i, g = l,zM,x[" +1 ]} 
E? =1 Pr{^+i,£ = l,ZN,X[«+i]}' 

exp{/3 t r(X n+li .,zM,Z ra+lig )} 
Zti exp{/3*T(X n+1 ,., zw , Z n+1/ )} 
/ Q n Q 
oc exp + z j(h( x n+l,j) + ^2N £ (Z^)a(r] q£ ) 

V 1=1 j=i 1=1 

4.3. Computing Q(/3|/3') in the online context. As regards the online ver- 
sion of the SAEM algorithm, the difference between the old and the new 
complete-data log-likelihood may be expressed as 

log Pr(xt n+1 ] , Zl n+1 ] , f3) - log Pr(xM , Z M /3) 

= loga q + ^2 Z u logPr{X n+lti \Z n+ i jq Z u }, 

/,j<n+l 

where the added simulated vertex label is equal to q {Z n+ i^ q = 1). 

Recall that in the online framework, the label of the new node has been 
sampled from the Gibbs sampler described in Section 4.2. Consequently, 
only one possible label is considered in this equation. Then a natural way 
to adapt equation (4.1) to the online context is to approximate 

Q n+ xW {n] ) ~ QnW ln] ) 

by 

log Pr(xt n+1 ] , Zl n+1 ] , (3) - log Pr(X W , Z [n] , (3) . 

Indeed, this quantity corresponds to the difference between the log-likelihood 
of the original network and log-likelihood of the new network including the 
additional node. Notice that the larger the network, the larger its associated 
complete expected log-likelikelihood. Thus, logPr(X[™ +1 l, Zl n+1 ],/3) becomes 
smaller and smaller compared to Q((3\(3') as n increases. The decreasing step 
p n is thus set to one in this online context. We propose the following update 
equation for stochastic online EM computation of the MixNet conditional 
expectation: 

Q n+1 ((3\(3W) = Q n ((3\(3W) + loga q + ^ Z a logPr{X n+1)i |Z n+1)(? ^}, 

l,i<n+l 

where Z n+ i is drawn from the Gibbs sampler. 
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4.4. Maximizing Q{(3\f3'), and parameters update. The principle of online 
algorithms is to modify the current parameter estimation using the informa- 
tion added by a new available [n + 1] node and its corresponding connections 
X n +i i# to the already existing network. Maximizing Q n+ i((3\(3^) according 
to f3 is straightforward and produces the maximum likelihood estimates for 
iteration [n + 1] . Here we have proposed a simple version of the algorithm 
by setting the number of simulations to one (m(k) = 1). In this context, 
the difference between Q n {f3\P^) and Q n+ i((3\(3^) implies only the terms 
of the complete log-likelihood which are a function of node n + l. Using 
notation tp ql = 9a ^\ we get 

(a q n+1] =N q (Z^)/(n + l), 

\^ +1] =H ql (Xl^,Z^)/G ql (Zl n +% 

where (£, q i,Cqi) were defined in the previous section. Notice that updating 
the function ip„i of the parameter of interest is often more convenient in 
an online context than directly considering this parameter of interest. An 
example of parameter update is given for the Bernoulli and Poisson cases in 
the Appendix. 

Once all the nodes in the network have been visited (or are known), the 
parameters can be further improved and the complete log-likelihood better 
approximated by continuing with the SAEM algorithm described above. 

5. Application of online algorithm to variational methods. Variational 
methods constitute an alternative to SAEM. Their principle is to approxi- 
mate the untractable distribution Pr{Z|X;/3} by a newly introduced distri- 
bution on Z denoted by 1Z. Then this new distribution is used to optimize 
^(X, 1Z(Z); (3), an approximation (lower bound) of the incomplete-data log- 
likelihood logPr{X;/3}, defined such that 

J(X,ft(Z);/3) = logPr{X;/3} - KL(K{Z), Pr{Z|X; 0}), 

with KL(»\») being the Kullback-Leibler divergence between probability 
distributions [Jordan et al. (1999)]. Then one must choose the form of 1Z, and 
the product of Multinomial distributions is natural in the case of MixNet, 
with logT^(Z) = Y^i S<j ^iglogTig, and the constraint X^ 9 r «g = 1- in this case, 
the form of J(X,K(Z);0) is 

J(X, K(Z); (3) = ^(Z; t) log Pr{X, Z; (3} - ]T TZ(Z; r) log TZ(Z; r) 
z z 

with Q(t,/3) an approximation of the conditional expectation of the complete- 
data log- likelihood, and %(TZ(Z;t)) the entropy of the approximate poste- 
rior distribution of Z. 
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The implementation of variational methods in online algorithms relies on 
the additivity property of J7~(X,7£(Z); (3) when nodes are added. This prop- 
erty is straightforward: Q(t,(3) is additive thanks to equation (3.1) [because 
7£(Z) is factorized], and %(7£(Z;r)) is also additive, since the hidden vari- 
ables are supposed independent under 1Z and the entropy of independent 
variables is additive. The variational algorithm is very similar to an EM 
algorithm, with the E-step being replaced by a variational step which aims 
at updating variational parameters. Then a standard M-step follows. In the 
following, we give the details of these two steps in the case of a variational 
online algorithm. 

5.1. Online variational step. When a new node is added, it is necessary 
to compute its associated variational parameters {T n +i tq } q . If we consider 
all the other Ti q for i < n+ 1 as known, the {r n+ i ](? } g are obtained by differ- 
entiating the criterion 

n+i / Q \ 

i=l \q=l J 

where the Aj are the Lagrangian parameters. Since function J is additive 
according to the nodes, the calculation of its derivative according to 
gives 

Q n 

4 nl + E E T j? fa* 1 J ) + afaj? ) ) - log r n+1 , q + 1 + A n+1 = 0. 

1=1 3=1 

This leads to 

r Q n \ 

r n+1 , q oc «N exp EE^W^^') + afaj 1 )) 

I 1=1 j=l J 

(5.1) 

Vge{l,...,Q}. 



5.2. Maximization/update step. To maximize the approximated expec- 
tation of the complete log- likelihood according to (3, we solve 

( 5 - 2 ) ™ = ^5 = °- 



d(3 * l " J V 9/3 

Differentiating equation (5.2) with respect to parameters {oJ q } gives the 
following update equation: 



q n + 



^i(E^ nl+r ™+ 1 ^ 
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The other update equation is obtained by considering parameters {r] q i } , and 
using notation tp q i, which gives 



Thanks to equation (3.1), which gives the relationships between sufficient 
statistics at two successive iterations, parameters can be computed recur- 
sively using the update of the expectation of the sufficient statistics, such 
that 



An example of parameters update is given in the Appendix for both the 
Bernoulli and the Poisson distributions. Note the similarity of the formula 
compared with the SAEM strategy. Hidden variables Z are either simulated 
or replaced by their approximated conditional expectation (variational pa- 
rameters). 

6. Experiments. 

Motivations. Experiments are carried out to assess the trade-off estab- 
lished by online algorithms in terms of quality of estimation and speed of exe- 
cution. We propose a two-online-step simulation study. We first report simu- 
lation experiments using synthetic data generated according to the assumed 
random graph model. In this first experiment we use a simple affiliation 
model to check precisely the quality of the estimations given by the online 
algorithms. Results are compared to the batch variational EM proposed by 
Daudin, Picard and Robin (2008) to assess the effect of the online framework 
on the estimation quality and on the speed of execution. In a second step, we 
use a real data set from the web as a starting point to simulate growing net- 
works with complex structure, and to assess the performance of online meth- 
ods on this type of network. An ANSI C++ implementation of the algorithms 
is available at http://stat.genopole.cnrs.fr/software/mixnet/, as well as an 
R package named MixeR (http://cran.r-project.org/web/packages/mixer/), 
along with public data sets. This software is currently used by the Constella- 
tions online application (http:/ /constellations.labs.exalead.com/), which in- 
stantaneously extracts, visually explores and takes advantages of the MixNet 
algorithm to reveal the connectivity information induced by hyperlinks be- 
tween the first hits of a given search request. 




n+l _ 



E nln] (H ql (X^,Z^)) 
E nM (G ql (Zi^])) 



E nM (N q (Z^)) 
E^Ki^X^Z^ 11 )) 
E nln] (G ql (Z^)) 



E nln] (N q (ZW))+M n[n] (Z n+1>q ), 

E^^Kx^u^^+^H^r 11 ) 

E nln] (G ql (Z^))+E nln] (d +1] ). 
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6.1. Comparison of algorithms. 

Simulations set-up. We simulate affiliation models with A and e being the 
within and between group probability of connection respectively. Five mod- 
els are considered (Table 2). We set A = 1 — e to reduce the number of free 
parameters, with parameter A controlling the complexity of the model. Dif- 
ferences between models lie in their modular structure which varies from no 
structure (almost the Erdos-Renyi model) to strong modular structure (low 
inter-module connectivity and strong intra-module connectivity, or strong 
inter-module connectivity and low intra-module connectivity). Figure 2 illus- 
trates three kinds of connectivity which allows to represent graphically model 
1, 4 and 5. For each affiliation model we generate graphs with Q € {2, 5, 20} 
groups mixed in the same proportions 1/Q. The number of nodes n varies in 
{100, 250, 500, 750, 1000, 2000} to explore different sizes of graphs. We gen- 
erate a total of 45 graph models, each being simulated 30 times. 

Criteria of comparison. The comparison between algorithms is done us- 
ing the bias E(e — e)/e and the mean square error V(e) to reflect estimators 
variability. We also use the adjusted Rand Index [Hubert and Arabie (1985)] 
to evaluate the agreement between the estimated and the actual partitions. 
Computing this index is based on a ratio between the number of node pairs 
belonging to the same and to different classes when considering the actual 
partition and the estimated partition. It lies between and 1, two identical 
partitions having an adjusted Rand Index equal to 1. 

Algorithms set-up. In a first step we compete algorithms that are based 
on maximum likelihood estimation (MLE). The online SAEM and online 
variational method we propose are compared with the variational method 
proposed in Daudin, Picard and Robin (2008) (batch MixNet in the sequel). 
We also add an online classification version (online CEM) in the comparison 
since this strategy has been shown to reduce the computational cost as well 
[Zanghi, Ambroise and Miele (2008)]. To avoid initialization issues, each 

Table 2 

Parameters of the five affiliation models considered in the experimental setting 



Model 


e 


A 


1 


0.3 


0.7 


2 


0.35 


0.65 


3 


0.4 


0.6 


4 


0.5 


0.5 


5 


0.9 


0.1 
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Fig. 2. Top left: low inter-module connectivity and strong intra-module connectivity 
(model 1). Top right: strong inter-module connectivity and low intra-module connectiv- 
ity (model 5). Bottom center: Erdds-Renyi model (model 4). 

algorithm is started with the same strategy: multiple initialization points are 
proposed and the best result is selected based on its likelihood. The number 
of clusters is chosen using the Integrated Classification Likelihood criterion, 
as proposed in Daudin, Picard and Robin (2008). The algorithms are stopped 
when the parameters are stable between two consecutive iterations. In a 
second step, we compare the MLE-based algorithms with other competitors 
like spectral clustering [Ng, Jordan and Weiss (2002)] and a fe-means like 
algorithm [Newman (2006)]. 

Estimators bias and MSE (Table 3). A first result is that every algorithm 
provides estimators with negligible bias (lower than 1%) and variance for 
highly structured models (models 1, 2, 5, Table 3). The online framework 
shows its limitations when the structure of the network is less pronounced 
(model 3), as every online method shows a significant bias and low precision, 
whereas the batch MixNet behaves well. This limitation was expected, as the 
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Table 3 

Bias (in percent) and Root Mean Square Errors (xW 3 ) for the parameters estimators in 
the five affiliation models. The Q modules are mixed in the same proportion. Each model 
considers n = 500 nodes and Q — 5 groups 





Online-SAEM 


Online- variational 


Online-CEM 


Batch-MixNet 


Model 


B%(e) 


B%(A) 


B%(e) 


B%(A) 


B%(e) 


B%(A) 


B%(e) 


B%(A) 


1 


-0.14 


0.04 


-0.13 


0.04 


-0.13 


0.04 


-0.13 


0.04 


2 


0.23 


-1.01 


0.04 


-0.11 


-0.03 


0.01 


-0.03 


0.00 


3 


9.47 


-26.38 


8.83 


-24.32 


6.44 


-22.46 


-0.01 


-0.11 


4 


1.11 


-4.29 


0.16 


-0.35 


3.00 


-4.32 


0.05 


-0.01 


5 


-0.01 


-0.02 


-0.01 


-0.02 


-0.01 


-0.02 


-0.01 


-0.02 




RMSE(e) 


RMSE(A) 


RMSE(e) 


RMSE(A) RMSE(e) RMSE(A) 


RMSE(e) 


RMSE(A) 


1 


1.45 


2.25 


1.42 


2.25 


1.45 


2.25 


1.45 


2.25 


2 


1.89 


4.04 


1.65 


2.90 


1.63 


2.90 


1.63 


2.90 


3 


5.19 


14.75 


6.95 


22.32 


13.89 


25.96 


2.14 


6.74 


4 


3.75 


10.42 


1.33 


1.67 


8.21 


15.71 


1.25 


1.62 


5 


0.92 


1.73 


0.92 


1.73 


0.93 


1.73 


0.92 


1.73 










Table 4 










Means and standard deviations of the Rand Index for all 


models with q and 


n fixed 




Online-SAEM 


Online- variational 


Online-CEM 


Batch-MixNet 


Model 


rand 


& rand 


rand 


Crand 


rand 


&rand 


rand 


O rand 


1 


0.98 


0.02 


0.98 


0.02 


0.98 


0.02 


0.99 


0.02 


2 


0.96 


0.07 


0.97 


0.07 


0.97 


0.07 


0.98 


0.01 


3 


0.13 


0.13 


0.10 


0.15 


0.25 


0.16 


0.85 


0.14 


4 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 


1 


0.00 


1 


0.01 


1 


0.01 


1 


0.01 



gain in computational burden has an impact on the complexity of structures 
that can be identified. Finally, among online versions of the algorithm, the 
online variational method provides the best results on average in terms of 
bias and precision. 

Quality of partitions (Table 4 ). We also focus on the Rand Index for each 
algorithm. Indeed, even if poor estimation of A reveals a small Rand Index 
(Table 4), good estimates do not always lead to correctly estimated parti- 
tions. An illustration is given with model 3 for which algorithms produce 
good estimates with poor Rand Index, due to the nonmodular structure of 
the network. As expected, the performance increases with the number of 
nodes (Table 5). 
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Computational efficiency (Table 5). Since the aim of online methods is 
to provide computationally efficient algorithms, the performance mentioned 
above should be put in perspective with the speed of execution of each 
algorithm. Indeed, Table 5 shows the strong gain of speed provided by online 
methods compared with the batch algorithm. The speed of execution is 
divided by 100 on networks with 2000 nodes, for instance. Table 5 also 
shows that there is no significant difference in the speed of execution among 
online methods. Since the online variational method provides the best results 
in terms of estimation precision, with no significant difference with other 
methods on partition quality or speed, this will be the algorithm chosen for 
the following. 

Comparison with other algorithms (Table 6). The above results show 
that a strong case may be made for the online variational algorithm when 
choosing between alternative clustering methods. Consequently, we shall now 
compare it with two suitable "rivals" for large networks: a basic spectral 
clustering algorithm [Ng, Jordan and Weiss (2002)], and one of the popular 
community detection algorithms [Newman (2006)]. The spectral clustering 
algorithm searches for a partition in the space spanned by the eigenvectors 
of the normalized Laplacian, whereas the community detection algorithm 
looks for modules which are defined by high intra-connectivity and low inter- 
connectivity. 



Table 5 

Means and standard deviations of the Rand Index with speed of the algorithms, q = 5, 

model 2 



n 


Online-SAEM 


Online- variational 


Online-CEM 


Batch-MixNet 


rand 


^ rand 


rand 


& rand 


rand 


0" rand 


rand 


& rand 


100 


0.15 


0.04 


0.15 


0.07 


0.15 


0.05 


0.19 


0.09 


250 


0.50 


0.09 


0.55 


0.11 


0.51 


0.01 


0.95 


0.07 


500 


0.62 


0.09 


0.62 


0.11 


0.65 


0.14 


1 


0.00 


750 


0.84 


0.03 


0.85 


0.03 


0.84 


0.04 


1 


0.00 


1000 


0.94 


0.01 


0.95 


0.01 


0.92 


9.37 


1 


0.00 


2000 


0.98 


0.00 


0.98 


0.01 


0.98 


0.01 


1 


0.00 




time 




time 




time 


&time 


time 


& time 


100 


0.09 


0.00 


0.09 


0.00 


0.09 


0.00 


0.10 


0.00 


250 


1.31 


0.01 


1.32 


0.01 


1.31 


0.00 


3.18 


0.01 


500 


1.41 


0.01 


1.46 


0.01 


1.41 


0.01 


49.46 


0.13 


750 


3.45 


0.02 


3.57 


0.02 


3.44 


0.02 


251.32 


0.75 


1000 


9.46 


0.41 


9.61 


0.43 


9.37 


0.40 


805.92 


0.49 


2000 


157.31 


1.28 


158.21 


1.41 


157.12 


2.08 


13051.10 


73.75 
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Table 6 

Means and standard deviation of the Rand Index for the five models computed over 30 
different runs for graph clustering competitors and variational algorithms 



Community detection Spectral clustering Online-variational 



Model 


rand 


& rand 


rand 


& ra nd 


rand 


& rand 


1 


1.00 


0.00 


0.97 


0.14 


1.00 


0.00 


2 


0.99 


0.01 


0.98 


0.00 


1.00 


0.00 


3 


0.97 


0.02 


0.97 


0.00 


1.00 


0.00 


4 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


.-) 


0.00 


0.00 


0.92 


0.19 


1.00 


0.00 



For our five models with arbitrary fixed parameters n = 1000, Q = 3, we 
ran these algorithms and computed the Rand Index for each of them. From 
Table 6 we see that our online variational algorithm always produces the 
best clustering of nodes. 

We generated networks using the MixNet data generating process. Thus, 
these results correspond to what may be expected on networks that display 
a blockmodel structure: the online variational algorithm always yields the 
best node classification. Apart from model 4, it will also be remarked that 
the spectral algorithm is fairly efficient with a slight bias, and so the spec- 
tral clustering algorithm is consistently more accurate than the community 
algorithm, the latter failing completely when applied to model 5. Although 
the community algorithm appears less well adapted to these experiments, 
we shall see in the next section that this algorithm is particularly suitable 
when partitioning data sets whose nodes are densely interconnected. 

6.2. Realistic networks growing over time. In this section we use a real 
network as a template to simulate a realistic complex structure. For this pur- 
pose, we use a French Political Blogosphere network data set that consists 
of a sample of 196 political blogs from a single day snapshot. This network 
was automatically extracted October 14, 2006 and manually classified by the 
"Observatoire Presidentielle" project. This project is the result of a collab- 
oration between RTGI SAS and Exalead and aims at analyzing the French 
presidential campaign on the web. In this data set, nodes represent host- 
names (a hostname contains a set of pages) and edges represent hyperlinks 
between different hostnames. If several links exist between two different host- 
names, we collapse them into a single one. Note that intra-domain links can 
be considered if hostnames are not identical. Finally, in this experimenta- 
tion we consider that edges are not oriented, which is not realistic but which 
does not affect the interpretation of the groups. Six known communities 
compose this network: Gauche (French Democrat), Divers Centre (Moderate 
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1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


1 
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13 






15 










2 




100 








19 


89 


6 




66 




3 






39 


83 


12 


6 


10 










4 


13 




83 


100 


72 


38 


67 










5 






12 


72 


83 


17 


20 










6 




19 


6 


38 


17 


15 


60 










7 
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15 


89 
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10 


67 


20 
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19 


19 
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19 








55 



Fig. 3. MixNet results display on the French political Blogosphere represented with the 
organic layout of Cytoscape (Shannon et al. 2003). The table corresponds to the proba- 
bilities (xlQQ) of connection between the 11 selected clusters [using a penalized likelihood 
criterion described in Daudin, Picard and Robin (2008)]. Dots in the table correspond to 
connections lower than 1%. 

party), Droite (French Republican), Ecologiste (Green), Liberal (supporters 
of economic-liberalism) and, finally, Analysts. The data is provided within 
the MixeR package. This network presents an interesting organization due 
to the existence of several political parties and commentators. This complex 
connectivity pattern is enhanced by MixNet parameters given in Figure 3. 

As the algorithm is motivated by large data sets, we use the parameters 
given by MixNet to generate networks that grow over time. We use this 
French Blog to generate a realistic network structure as a start point. We 
simulate 200 nodes networks from this model, then we iterate by simulating 
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the growth over time of these networks according to the same model and 
we use the online algorithm to update parameters sequentially. The result is 
striking: even on very large networks with ~13,000 nodes and ~13,000,000 
edges, the online algorithm allows us to estimate mixture parameters with 
negligible classification error in ~6 minutes (Table 7). This is the only al- 
gorithmic framework that allows to perform model based clustering on net- 
works of that size. 

7. Application to the 2008 US Presidential WebSphere. Since its cre- 
ation and enhanced by its recent social aspect (Web 2.0), the World Wide 
Web is the space where individuals use Internet technologies to talk, discuss 
and debate. Such space can be seen as a directed graph where the pages and 
hyperlinks are respectively represented by nodes and edges. From this graph, 
many studies, like Broder et al. (2000), have been published and introduced 
the key properties of the Web structure. However, this section rather focuses 
on local studies by considering that the Web is formed by territories and 
communities with their own conversation leaders and participants [Ghitalla 
et al. (2003)]. Here, we define a territory as a group of websites concerned 
by the same topic and a community as a group of websites in the same 
territory which may share the same opinion or the same link connectivity. 
One usually assumes that the existence of a hyperlink between two pages 
implies that they are content-related [Kleinberg (1999); Davison (2000)]. By 
exploring the link page exchanges, one can actually draw the borders of web 
territories/communities. 

Comparison with a community detection algorithm. A first step consists 
in comparing the results of MixNet with the community detection algo- 
rithm proposed by Newman (2006). If the political classification is used as 
a reference, the community algorithm produces better agreement with a 
randlndex = 0.59, compared with a randlndex = 0.25 for MixNet (see Table 
8). However, it appears that this comparison favors Newman, whereas the 
methods have different objective. Indeed, the community algorithm aims 
at finding modules which are defined by high intra-connectivity and low 
inter-connectivity. Given that websites tend to link to one another in line 
with political affinities, the link topology corresponding to the manual clas- 
sification naturally favors the community module definition. The objective 
function can also help to explain the community algorithm's suitability for 
this data set, since the quality of a partition in terms of Newman's modules 
can be expressed in terms of the modularity, which is maximized. The value 
of this modularity is a scalar between —1 and 1 and measures the density of 
links inside communities as compared to links between communities [New- 
man (2006)]. When applying both algorithms on our political network with 
Q = 3, the online variational algorithm yields a modularity = 0.20, whereas 
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Table 7 

Quality of the clustering procedure in terms of Rand Index when the network grows over 
time. Each configuration has been simulated 100 times 



# nodes (previous + new) 


Ave. edges 


Ave. rand 


Ave. cpu time (s) 


200 


3131.72 


0.94 


0.9 


200 + 200 


50,316.32 


0.998 


0.4 


400 + 400 


12,486.24 


0.999 


1.4 


800 + 800 


201,009.5 


1 


5.7 


1600 + 1600 


803,179.6 


1 


22.8 


3200 + 3200 


3,202,196 


1 


91.9 


6400 + 6400 


12,804,008 


1 


371.1 



the community algorithm yields a modularity = 0.30, which is close to the 
manual partition modularity of 0.28. As MixNet classes do not necessarily 
take the form of modules, one might expect our approach to yield a mod- 
ularity index that is not "optimal." Nevertheless, the two class definitions 
are complementary, and both are needed in order to give a global overview 
of a network: the community partition to detect dense node connectivity, 
and the MixNet partition to analyze nodes with similar connectivity pro- 
files. However, as mentioned by Adamic and Glance (2005), the division 
between liberal and conservative blogs is "unmistakable," this is why it may 
be more interesting to uncover the structure of the two communities rather 
than detecting them. 

Interpreting MixNet results. MixNet first confirms what was already 
mentioned by Adamic and Glance (2005): the political websphere is par- 
tioned according to political orientations. In addition, MixNet highlights 
the role of main US online portals as the core of this websphere (Figure 4, 
C17). Political communities do not directly cite their opponents but com- 
municate through nytimes.com, washingtonpost.com, cnn.com or msn.com, 
for instance (in C17). This central structure has two main significations: it 
confirms the political cyberbalkanization trend that was already observed in 
2004, and it emphasizes the role of mass media websites as political referees. 
Plus, the connectivity pattern estimated by the model shows a particular 

Table 8 

Contingency table comparing the political partition and MixNet partition 





Conservative 


Independent 


Liberal 


Cluster 1 


734 


135 


238 


Cluster 2 


290 


26 


8 


Cluster 3 


2 


7 


430 



ONLINE METHODS FOR MODEL-BASED CLUSTERING ON NETWORKS 23 



■ 


Conservative 


□ 


Independent 
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Liberal 





Fig. 4. Network summary of US political websites. Each vertex represents a cluster. Each 
pie chart gives the proportions of liberal, conservative and independent tagged websites in 
the cluster. The outer ring color of the vertices is proportional to the intensity of the in- 
tra- connectivity: the darker, the weaker. Edges are represented when the inter- connectivity 
is among the 20% of the largest among all connectivity values. 



affinity between the mass-media cluster with the liberal thought, as connec- 
tions are stronger toward the liberal part of the web logs (Table 9). 

Then the question is to determine what are the structural characteristics 
of the liberal and conservative territories (note that independent sites do 
not seem to be structured on their own). MixNet reveals a hierarchical or- 
ganization of political sub-spheres with weblogs having a determinant role 
in the structuration of the liberal community, reachm.com, mahablog.com, 
juancole.com (C20), which are well known to be at the core of the liberal 
debate on the web. This results in a set of clusters (C7, C8, C12, C13 and 
C20) that show very strong intra and inter group connectivities which nearly 
forms a clique (Table 9). The balkanization is also observed within territories, 
as radical positions, like in the feministe.us website (C6), are only spread 
through core websites (vr20,6 = 99%, for instance). A last level of hierarchy is 
made by liberal blogs that show intermediate connections within the same 
liberal territory. 

Interestingly, this subdivision is also present in the conservative part of 
the network, with very famous websites like foxnews.com (C14) being at the 
center of the debate. Indeed, clusters C3, C14, C16, C18 and C19 constitute 
the core of the conservative websphere, and clusters CI and C2 are very 
lightly connected with other conservative blogs. The difference lies in the 
intensity of connection, which is lower for the conservatives. 
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Table 9 

Estimated n (in percentage) and number of nodes in each cluster for the US political 
websphere, d represents the estimated mean degree of each group. Clusters with 
probabilities of connection lower than 1% are not represented (clusters 3, 5, 15) 



Conservative Liberal 



ID 


17 
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4 
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66 


56 


1 


24 


19 
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26 


58 
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51 


20 


37 


23 
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d 
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86 


28 


149 


69 


455 


335 


167 


172 


192 


64 


66 


66 


310 


154 


170 


324 



Compared with available methods that can analyze networks of such size 
(like community detection), MixNet shows structures of the political web- 
sphere that are more complex than the expected liberal/conservative split. 
The model highlights the structural similarities that exist between spheres 
of political opponents. Both communities are characterized by a small set 
of sites which use the internet in a very professional and efficient way, with 
a lot of cross-linking. This results in a core structure to which other sites 
are linked, these other sites being less efficient in the citations to other web- 
sites. This could be explained either by a tendency to ignore other elements 
in the debate or by a use of the internet which is less efficient. Interest- 
ingly, this structure is very similar between conservatives and liberals, with 
the liberal core being more tight. For the liberal blogs, this observation 
can result from a better understanding of their Web Ecosystem. This inter- 
pretation is reinforced by the different betweenness centralities of MixNet 
classes. Betweenness is based on the number of shortest geodesic paths that 
pass through a vertex. Figure 5 shows that MixNet betweenness is higher 
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Fig. 5. Boxplot of MixNet classes betweenness (in log). 

for MixNet core classes on average in both political structures, whereas the 
betweenness patterns of the liberals and conservatives look very similar. 

8. Conclusion. In this paper we propose an online version of estimation 
algorithms for random graphs which are based on a mixture of distributions. 
These strategies allow the estimation of model parameters within a reason- 
able computation time for data sets which can be made up of thousands of 
nodes. These methods constitute a trade-off between the potential amount 
of data to process and the quality of the estimations: even if online methods 
are not as precise as batch methods for estimation, they may represent a 
solution when the size of the network is too large for any existing estima- 
tion strategy. Furthermore, our simulation study shows that the quality of 
the remaining partition is good when using online methods. In the network 
of 2008 US political websites, we could uncover the structure that makes 
the political websphere. This structure is very different from classical mod- 
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ules or "communities," which highlights the need for efficient computational 
strategies to perform model-based clustering on large graphs. The online 
framework is very flexible, and could be applied to other models such as the 
block model and the mixed membership model, as the online framework can 
be adapted to Bayesian algorithms [Opper (1999)]. 



APPENDIX 

A.l. Examples of distributions for the exponential family. We provide 
some examples of common distributions that can be used in the context 
of networks. For example, when the only available information is the pres- 
ence or the absence of an edge, then is assumed to follow a Bernoulli 
distribution: 



Xij | Z 



1 ~ B{ir ql ) < 



Vql = log 
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Xi 



log(l 
= 0. 
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ql) 



h(Xij) 
a(Vql) = 

I KXij) - 

If additional information is available to describe the connections between 
vertices, it may be integrated into the model. For example, the Poisson 
distribution might describe the intensity of the traffic between nodes. A 
typical example in web access log mining is the number of users going from 
a page i to a page j. Another example is provided by co-authorship networks, 
for which valuation may describe the number of articles commonly published 
by the authors of the network. In those cases, we have 

' T) q l= log \ q l, 
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A.2. Parameters update in the Bernoulli and Poisson cases for the online 
SAEM. The estimator becomes 
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C^ +1] = Z^JVKZM) + Z n+1>l N q (zW) + Z n+1 , ? % = I}. 

A.3. Parameters update in the Bernoulli and the Poisson cases for the 
online variational algorithm. We get the following update equation: 

n+l] _ [n+1] [n}. ( , [n+lk%»% 



where 
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